Predictive Modeling Discussions¶

  • a. Are you working on a REGRESSION or CLASSIFICATION problem?

    The given problem can be approached as both REGRESSION OR CLASSIFICATION problem. I have decided to approach this as a classification problem by categorizing tracks with popularity < 50 as 0 (Unpopular) and songs with popularity > 50 as 1 (Popular) in a new columns called as track_popularity_bin. The goal is to come up with a classification model that would classify a track based on the characteristics as 0 or 1

  • b. Which variables are inputs?

    Following variables are the final input variables I have identifed after EDA

    1. danceability
    2. energy
    3. key
    4. loudness
    5. mode
    6. echiness
    7. acousticness
    8. instrumentalness
    9. liveness
    10. valence
    11. tempo
    12. duration_ms
  • c. Which variables are responses/outputs/outcomes/targets?

    • track_popularity_bin is the target variable
  • d. Did you need to DERIVE the responses of interest by SUMMARIZING the available data?

    • Yes
  • e. If so, what summary actions did you perform?

    • Grouped the songs with track_popularity < 50 as 0 and track_popularity > 50 as 1 in a new column called as track_popularity_bin
  • f. Which variables are identifiers and should NOT be used in the models?

    1. track_id
    2. track_album_id
    3. playlist_id
    4. track_name
    5. playlist_name
    6. track_artist
  • g. Important: Answer this question after completing parts C and D. Return to this predictive modeling discussion section to answer the following:

    i. Which of the inputs do you think influence the response, based on your exploratory visualizations? Which exploratory visualization helped you identify potential input-to-output relationships? (If you are not sure which inputs seem to influence the response, it is okay to say so.)

    Answer: Following visualizations helped with identifying the potential input-to-output relationships

    1. Conditional Distribution of continuous variables GROUPED BY the response (target) variable
    2. Relationships between continuous variables GROUPED BY the response (target) variable
    3. Conditional Distribution of continuous variables GROUPED BY the response (target) variable and additional categorical variable

    Inputs that influence response: Continuous variables that represent the characteristics of a track (danceability, energy, loudness, valence, tempo etc.,) influence the response (target) variable

Import Modules¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
In [ ]:
sns.set_palette("colorblind")

Loading the Dataset¶

The following steps loads the dataset from the given URL into a pandas dataframe named df

In [ ]:
songs_url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'

df_main = pd.read_csv(songs_url)

# Creating a copy. Keeping the main df intact in case needed for further analysis
df = df_main.copy()

Basic Info about the dataset¶

First, I find the dimensionality of the pandas dataframe using the df.shape method. This tells how many rows, columns are there in the dataframe. In this case it is 32833 rows and 23 columns

In [ ]:
df.shape
Out[ ]:
(32833, 23)

Then, I'm exploring the datatypes, count of not-null values in every column of the dataset using the df.info() method

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudness                  32833 non-null  float64
 15  mode                      32833 non-null  int64  
 16  speechiness               32833 non-null  float64
 17  acousticness              32833 non-null  float64
 18  instrumentalness          32833 non-null  float64
 19  liveness                  32833 non-null  float64
 20  valence                   32833 non-null  float64
 21  tempo                     32833 non-null  float64
 22  duration_ms               32833 non-null  int64  
dtypes: float64(9), int64(4), object(10)
memory usage: 5.8+ MB

Also, below is the description of each of the column in the dataset

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Analyzing target/output variable track_popularity¶

Above histogram shows that track_popularity is mostly normally distributed but the value 0 has lot more entries that other values. We can further confirm that using the following boxplot

Analysis¶

In [ ]:
df.track_popularity.describe()
Out[ ]:
count    32833.000000
mean        42.477081
std         24.984074
min          0.000000
25%         24.000000
50%         45.000000
75%         62.000000
max        100.000000
Name: track_popularity, dtype: float64
In [ ]:
df.track_id.nunique()
Out[ ]:
28356
In [ ]:
sns.boxplot(x=df["track_popularity"], showmeans=True, width=0.2)
Out[ ]:
<Axes: xlabel='track_popularity'>
No description has been provided for this image
In [ ]:
sns.displot(data = df, x='track_popularity', binwidth=5, aspect=1.25)

plt.show()
No description has been provided for this image
In [ ]:
# percentage of tracks with `track_popularity` = 0
print('percentage of tracks with `track_popularity` as 0 = ', np.mean( df.track_popularity == 0 ) * 100, '%')

# percentage of tracks with `track_popularity` = 100
print('percentage of tracks with `track_popularity` as 100 = ', np.mean( df.track_popularity == 100 ) * 100, '%')
percentage of tracks with `track_popularity` as 0 =  8.23257088904456 %
percentage of tracks with `track_popularity` as 100 =  0.0060914324003289375 %

⭐ Above plots reveal that although track_popularity is an integer column it is not a continous output. The values are between 0 and 100. A linear regression would work best if the output is continous. Since that is not the case this can be better approached classification problem.

✨ To do that I will be creating a new column called as track_popularity_bin. Tracks with track_popularity > 50 will be considered as 1 (popular) and the ones <= 50 will be considered as 0 (unpopular)

In [ ]:
df['track_popularity_bin'] = np.where( df.track_popularity > 50, 1, 0 )
In [ ]:
df = df.astype({'track_popularity_bin': 'object'})
In [ ]:
df.track_popularity_bin.value_counts(normalize=True)
Out[ ]:
track_popularity_bin
0    0.574757
1    0.425243
Name: proportion, dtype: float64

💡 Although not perfectly balanced, the binary outcome is not overly imbalanced and so conventional classification approaches can be applied.

Handle Duplicates¶

The dataframe has 32833 rows but the above analysis tells me that each row doesn't have a unique track_id and that is why number of unique track_id is 28356 and not 32833

Then, I look to see if the charactersistics of the duplicated tracks change across the entry

In [ ]:
track_characteristics=['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
In [ ]:
for tc in track_characteristics:
    print(f'==={tc}===')
    print(df.groupby(['track_id']).\
    aggregate(num_track_pop_values = ('track_popularity', 'nunique'),
              num_charc_values = (tc, 'nunique')).\
    reset_index().\
    nunique())
===danceability===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===energy===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===key===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===loudness===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===mode===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===speechiness===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===acousticness===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===instrumentalness===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===liveness===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===valence===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64
===tempo===
track_id                28356
num_track_pop_values        1
num_charc_values            1
dtype: int64

💡 The above analysis reveals that there is 1 and only 1 value for num_track_pop_values and num_valence_values. Thus, all unique tracks have a single track_popularity value and single value for the characteristics.

📌 Based on this info, we can remove the duplicate tracks by retaining only the first occurance of a track_id

In [ ]:
df.drop_duplicates(subset=['track_id'], keep='first', inplace=True)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28356 entries, 0 to 32832
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  28356 non-null  object 
 1   track_name                28352 non-null  object 
 2   track_artist              28352 non-null  object 
 3   track_popularity          28356 non-null  int64  
 4   track_album_id            28356 non-null  object 
 5   track_album_name          28352 non-null  object 
 6   track_album_release_date  28356 non-null  object 
 7   playlist_name             28356 non-null  object 
 8   playlist_id               28356 non-null  object 
 9   playlist_genre            28356 non-null  object 
 10  playlist_subgenre         28356 non-null  object 
 11  danceability              28356 non-null  float64
 12  energy                    28356 non-null  float64
 13  key                       28356 non-null  int64  
 14  loudness                  28356 non-null  float64
 15  mode                      28356 non-null  int64  
 16  speechiness               28356 non-null  float64
 17  acousticness              28356 non-null  float64
 18  instrumentalness          28356 non-null  float64
 19  liveness                  28356 non-null  float64
 20  valence                   28356 non-null  float64
 21  tempo                     28356 non-null  float64
 22  duration_ms               28356 non-null  int64  
 23  track_popularity_bin      28356 non-null  object 
dtypes: float64(9), int64(4), object(11)
memory usage: 5.4+ MB

Exploratory Data Analysis¶

Performing essential EDA using pandas methods¶

  1. Missing Values
  2. Unique Values

First, I look to see number of missing values for each column. This is important to know so that I can elimate columns with too many missing values as those columns won't provide much insights about the dataset

In [ ]:
df.isna().sum()
Out[ ]:
track_id                    0
track_name                  4
track_artist                4
track_popularity            0
track_album_id              0
track_album_name            4
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
track_popularity_bin        0
dtype: int64

The above analysis shows that only three columns track_name, track_artist, track_album_name have missing values and that too very minimal number of rows with missing value. So far, I have not dropped any columns based on missing values.

It is now time to look at number of unique values for each columns. Columns with too many unique values may not be informative or may lead to overfitting, while columns with too few unique values may not provide enough discriminative power.

In [ ]:
df.nunique(dropna=False)
Out[ ]:
track_id                    28356
track_name                  23450
track_artist                10693
track_popularity              101
track_album_id              22545
track_album_name            19744
track_album_release_date     4530
playlist_name                 448
playlist_id                   470
playlist_genre                  6
playlist_subgenre              24
danceability                  822
energy                        952
key                            12
loudness                    10222
mode                            2
speechiness                  1270
acousticness                 3731
instrumentalness             4729
liveness                     1624
valence                      1362
tempo                       17684
duration_ms                 19785
track_popularity_bin            2
dtype: int64

📌 Above analysis reveals that although key and mode are numeric columns those have only few unique values. So, we can treat these columns as categorical for analysis purposes

Analyzing and Visualizing Categorical Variables¶

Based on the dataframe info we can determine that following columns are categorical variables

  1. track_id
  2. track_name
  3. track_artist
  4. track_album_id
  5. track_album_name
  6. track_album_release_date
  7. playlist_name
  8. playlist_id
  9. playlist_genre
  10. playlist_subgenre
In [ ]:
df.describe(include='object')
Out[ ]:
track_id track_name track_artist track_album_id track_album_name track_album_release_date playlist_name playlist_id playlist_genre playlist_subgenre track_popularity_bin
count 28356 28352 28352 28356 28352 28356 28356 28356 28356 28356 28356
unique 28356 23449 10692 22545 19743 4530 448 470 6 24 2
top 6f807x0ima9a1j3VPbc7VN Breathe Queen 5L1xcowSxwzFUSJzvyMp48 Greatest Hits 2020-01-10 Indie Poptimism 72r6odw0Q3OWTCYMGA7Yiy rap southern hip hop 0
freq 1 18 130 42 135 201 294 100 5401 1583 17850

❌ There are far too many unique values in the columns track_artist, playlist_name, track_album_name, track_name. This makes these columns not very useful for training a model and it is not practical to show visualization for these columns. Note: Reg visualization I confirmed with the instructor on the Coursera Discussion Forum that it is not necessary to show visualization for categorical variables with far too many unique values.

image.png

❌ Identifier columns like track_id, track_album_id and playlist_id will also not be very useful

In [ ]:
df.drop(['track_id','track_album_id','playlist_id','track_artist','playlist_name', 'track_name', 'track_album_name'],
        inplace=True,
        axis=1)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28356 entries, 0 to 32832
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_popularity          28356 non-null  int64  
 1   track_album_release_date  28356 non-null  object 
 2   playlist_genre            28356 non-null  object 
 3   playlist_subgenre         28356 non-null  object 
 4   danceability              28356 non-null  float64
 5   energy                    28356 non-null  float64
 6   key                       28356 non-null  int64  
 7   loudness                  28356 non-null  float64
 8   mode                      28356 non-null  int64  
 9   speechiness               28356 non-null  float64
 10  acousticness              28356 non-null  float64
 11  instrumentalness          28356 non-null  float64
 12  liveness                  28356 non-null  float64
 13  valence                   28356 non-null  float64
 14  tempo                     28356 non-null  float64
 15  duration_ms               28356 non-null  int64  
 16  track_popularity_bin      28356 non-null  object 
dtypes: float64(9), int64(4), object(4)
memory usage: 3.9+ MB

Create new columns¶

The given data presents us with the opportunity to create additional columns. These new columns may have an impact on the target variable.

In this case, I have used the column track_album_release_date to create two new columns release_year and release_month

In [ ]:
df['track_album_release_date'] = pd.to_datetime(df['track_album_release_date'],  format='mixed')
In [ ]:
df['release_year'] = df.track_album_release_date.dt.year
In [ ]:
sns.catplot(data = df, y='release_year', kind='count', height=10, aspect=2)

plt.show()
No description has been provided for this image

💡 As seen above, the dataset is mostly made of songs released in the recent years rather than older years. This suggests that we can probably create one more variable that represents the release_year in two buckets. We can put songs released after 2010 in one bucket and songs released before 2010 in another

In [ ]:
df['release_year_bin'] = np.where( df.release_year < 2010 , 'older', 'recent')
In [ ]:
df.release_year_bin.value_counts()
Out[ ]:
release_year_bin
recent    20460
older      7896
Name: count, dtype: int64

Next, I created the release_month column

In [ ]:
df['release_month'] = df.track_album_release_date.dt.month
In [ ]:
sns.catplot(data = df, y='release_month', kind='count', height=8, aspect=1.5)

plt.show()
No description has been provided for this image

💡 Majority of songs in the given dataset are release in the month of Jan

Since, track_album_release_date has been split into release_year and release_month the column can be dropped

In [ ]:
df.drop(['track_album_release_date'], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28356 entries, 0 to 32832
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   track_popularity      28356 non-null  int64  
 1   playlist_genre        28356 non-null  object 
 2   playlist_subgenre     28356 non-null  object 
 3   danceability          28356 non-null  float64
 4   energy                28356 non-null  float64
 5   key                   28356 non-null  int64  
 6   loudness              28356 non-null  float64
 7   mode                  28356 non-null  int64  
 8   speechiness           28356 non-null  float64
 9   acousticness          28356 non-null  float64
 10  instrumentalness      28356 non-null  float64
 11  liveness              28356 non-null  float64
 12  valence               28356 non-null  float64
 13  tempo                 28356 non-null  float64
 14  duration_ms           28356 non-null  int64  
 15  track_popularity_bin  28356 non-null  object 
 16  release_year          28356 non-null  int32  
 17  release_year_bin      28356 non-null  object 
 18  release_month         28356 non-null  int32  
dtypes: float64(9), int32(2), int64(4), object(4)
memory usage: 4.1+ MB

💡 Let's visualize additional variables - key and mode

Although, key and mode are numeric columns, since there are less unique values these can be considered as categorical variables

In [ ]:
sns.catplot(data = df, y='key', kind='count', height=5, aspect=2)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7fda0b7e1510>
No description has been provided for this image
In [ ]:
sns.catplot(data = df, y='mode', kind='count', height=2, aspect=2)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7fd9b6c68df0>
No description has been provided for this image

Analyzing and Visualizing Continous Variables¶

In [ ]:
df.describe()
Out[ ]:
track_popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo duration_ms release_year release_month
count 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.000000 28356.00000 28356.000000 28356.000000 28356.000000
mean 39.329771 0.653372 0.698388 5.368000 -6.817696 0.565489 0.107954 0.177176 0.091117 0.190958 0.510387 120.95618 226575.967026 2011.054027 6.101813
std 23.702376 0.145785 0.183503 3.613904 3.036243 0.495701 0.102556 0.222803 0.232548 0.155894 0.234340 26.95456 61078.450819 11.229221 3.841027
min 0.000000 0.000000 0.000175 0.000000 -46.448000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 4000.000000 1957.000000 1.000000
25% 21.000000 0.561000 0.579000 2.000000 -8.309250 0.000000 0.041000 0.014375 0.000000 0.092600 0.329000 99.97200 187742.000000 2008.000000 2.000000
50% 42.000000 0.670000 0.722000 6.000000 -6.261000 1.000000 0.062600 0.079700 0.000021 0.127000 0.512000 121.99300 216933.000000 2016.000000 6.000000
75% 58.000000 0.760000 0.843000 9.000000 -4.709000 1.000000 0.133000 0.260000 0.006570 0.249000 0.695000 133.99900 254975.250000 2019.000000 10.000000
max 100.000000 0.983000 1.000000 11.000000 1.275000 1.000000 0.918000 0.994000 0.994000 0.996000 0.991000 239.44000 517810.000000 2020.000000 12.000000

The above table shows the basic statistics about the continous variables

For analyzing the continous variables, I start with creating a new data frame df_lf that will hold the data in the LONG FORMAT. This is done so that visualizations can be achieved easier using Seaborn

In [ ]:
df_features = df.select_dtypes('number').copy()
df_features.drop(['track_popularity'], axis=1, inplace=True) # Dropping Target variable
In [ ]:
df_objects = df.select_dtypes('object').copy()
In [ ]:
id_cols = ['rowid', 'track_popularity'] + df_objects.columns.to_list()
In [ ]:
df_lf = df.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
In [ ]:
df_lf
Out[ ]:
rowid track_popularity playlist_genre playlist_subgenre track_popularity_bin release_year_bin variable value
0 0 66 pop dance pop 1 recent danceability 0.748
1 1 67 pop dance pop 1 recent danceability 0.726
2 2 70 pop dance pop 1 recent danceability 0.675
3 3 60 pop dance pop 1 recent danceability 0.718
4 4 69 pop dance pop 1 recent danceability 0.650
... ... ... ... ... ... ... ... ...
396979 32828 42 edm progressive electro house 0 recent release_month 4.000
396980 32829 20 edm progressive electro house 0 recent release_month 3.000
396981 32830 14 edm progressive electro house 0 recent release_month 4.000
396982 32831 15 edm progressive electro house 0 recent release_month 1.000
396983 32832 27 edm progressive electro house 0 recent release_month 3.000

396984 rows × 8 columns

💡 To visualize continous variable we will be using Histograms and KDE plots

In [ ]:
sns.displot(data = df_lf, x='value', col='variable', kind='hist', kde=True,
            facet_kws={'sharex': False, 'sharey': False},
            common_bins=False,
            col_wrap=3)
plt.subplots_adjust(hspace=0.5)
plt.tight_layout()
plt.tight_layout()
plt.show()
No description has been provided for this image

💡 Observation¶

  • We can see that speechiness, instrumentalness, liveness, acousticness are skewed right.

  • loudness, mode, time_signature are skewed left.

  • Only danceability, energy, valence, tempo have normal / approx normal distribution.

📌 Before we can use the data for modeling, we must transform the left and right skewed features have more symmetrical and bell-shaped distributions.

Visualize Relatiopnships¶

Categorical-to-categorical relationships¶

track_popularity_bin Vs playlist_genre¶

In [ ]:
sns.catplot( data = df, x='track_popularity_bin', hue='playlist_genre', kind='count' )

plt.show()
No description has been provided for this image

💡 The above visualization between track_popularity_bin and playlist_genre using DODGED BAR CHART shows that the edm genre is the most unpopular genre. Other genres have approx same number of tracks across both unpopular and popular categories. We can also see that pop and rap are the most popular genres

playlist_subgenre Vs playlist_genre¶

Next, visualizing the relationship between playlist_subgenre and playlist_genre using HEATMAP

In [ ]:
fig, ax = plt.subplots(figsize=(20,10))

sns.heatmap( pd.crosstab( df.playlist_subgenre, df.playlist_genre ), ax = ax,
             annot=True, annot_kws={'size': 10}, fmt='d',
             cbar=False)

plt.show()

📌 Above heatmap shows that the playlist_subgenre is highly correlated with playlist_genre. With this information, playlist_subgenre field can be dropped from the dataset as having highly correlated variables doesn't add a lot of value for the final model.

In [ ]:
df.drop(columns=['playlist_subgenre'], inplace=True)

Categorial to Continuous¶

Key Vs Track Popularity¶

In [ ]:
sns.catplot( data = df, x='key', y='track_popularity', kind='point', linestyle='none')

plt.show()
No description has been provided for this image

💡 Tracks with higher popularity tend to have higher value for key

Mode Vs Track Popularity¶

In [ ]:
sns.catplot( data = df, x='mode', y='track_popularity', kind='point', linestyle='none')

plt.show()
No description has been provided for this image

💡 Tracks with higher popularity tend to have 1 as the mode

Release Year Vs Track Popularity¶

In [ ]:
sns.boxplot(data=df, x="release_year_bin", y="track_popularity", showmeans=True)
Out[ ]:
<Axes: xlabel='release_year_bin', ylabel='track_popularity'>
No description has been provided for this image

The above observation tells us that songs released recently have higher popularity compared to the songs released in the older years. This gives us a good indication that release_year_bin can have impact on the popularity.

Release Month Vs Track Popularity¶

In [ ]:
sns.catplot(data=df, x="release_month", y="track_popularity", kind='point', aspect=2, linestyle='none')
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7fd9edf9f6d0>
No description has been provided for this image

💡 Tracks that were released during the months of October, November and December tend to have higher popularity.

Continuous-to-Continuous Relationships¶

Corr plot for ALL variables¶

Correlation plot is one of the very effective way to view the relationships between continuous variables

Below is the corr plot for the given dataset

In [ ]:
fig, ax = plt.subplots(figsize=(20,15))

sns.heatmap(data = df.select_dtypes('number').corr(numeric_only=True),
            vmin=-1, vmax=1, center=0,
            cmap='coolwarm', cbar=False,
            annot=True, annot_kws={'size': 12},
            ax=ax)
plt.tight_layout()
plt.show()
No description has been provided for this image

💡 The above plot reveals the following

  1. energy and loudness are highly positively correlated
  2. energy and acousticness are highly negatively correlated
  3. track_popularity has a low correlation with all other variables. This is good because we can use the other variables to "predict" track popularity

Next, visualizing the relationships between few different continuous variables

energy Vs danceability¶

In [ ]:
sns.relplot(data = df, x='energy', y='danceability')

plt.show()
No description has been provided for this image

💡Tracks with higher energy tend to be more danceable

acousticness Vs loudness¶

In [ ]:
sns.relplot(data = df, x='acousticness', y='loudness')

plt.show()
No description has been provided for this image

💡 Tracks that are more acoustic tend to be less louder than the tracks that are less acoustic

valence Vs danceability¶

In [ ]:
sns.relplot(data = df, x='valence', y='danceability')

plt.show()
No description has been provided for this image

💡 Danceability of tracks increase with increase in valence

Visualize conditional distributions of the continuous inputs GROUPED BY the response (outcome) unique values¶

Will be visualizing the conditional distribution of continuous inputs group by the response variable track_popularity_bin

In [ ]:
sns.displot(data = df_lf, x='value', col='variable', kind='kde',
            hue='track_popularity_bin',
            facet_kws={'sharex': False, 'sharey': False},
            common_norm=False,
            col_wrap=3
           )

plt.show()
No description has been provided for this image

💡 Observations

  • danceability of popular songs (track_popularity_bin=1) is higher than the unpopular songs (track_popularity_bin=0)
  • acousticness on popular songs is lower than the unpopular songs
  • popular songs tend to have lesser duration compared to the unpopular songs

Visualize conditional distributions of the continuous inputs GROUPED BY the response (outcome) variable and additional categorical variable¶

In [ ]:
sns.catplot(data = df_lf, x='track_popularity_bin', y='value', col='variable',
            hue='playlist_genre',
            kind='box',
            sharey=False,
            showmeans=True,
            col_wrap=3,
            meanprops={'marker': 'o', 'markerfacecolor': 'white', 'markeredgecolor': 'black'})
plt.show()
No description has been provided for this image

💡 Observations

  • The danceability score is higher for pop and rap tracks in the popular category compared to the unpopular category
  • Tracks of the genre rock tend to be longer in duration than other genres
  • loudness seems to be higher for all genres in the popular category compared to the unpopular category
  • The distribution of the continuous variables vary for each genre. This allows to infer that the continuous variables by themselves is sufficient to determine the popularity of a track without relying on the genre

Visualize relationships between continuous inputs GROUPED BY the response (outcome) unique values¶

Here, leveraging SCATTER PLOT to visualize the relationships between continuous inputs GROUPED BY the response (outcome) unique values

tempo Vs valence GROUPED BY track_popularity_bin¶

In [ ]:
sns.relplot(data=df, x='tempo', y='valence', hue='track_popularity_bin')

plt.show()
No description has been provided for this image

💡 Popular tracks tend (track_popularity_bin=1) to have higher temp and valence compated to the unpopular tracks (track_popularity_bin=0)

tempo Vs danceability GROUPED BY track_popularity_bin¶

In [ ]:
sns.relplot(data=df, x='tempo', y='danceability', hue='track_popularity_bin')

plt.show()
No description has been provided for this image

💡Observations

  1. The danceability increases as tempo increases
  2. Popular songs tend to have higher tempo(track_popularity_bin=1) and danceability compared to the unpopular tracks (track_popularity_bin=1)
  3. Given the tempo and danceability score one can determine the popularity of a track

acousticness Vs loudness GROUPED BY track_popularity_bin¶

In [ ]:
sns.relplot(data=df, x='acousticness', y='loudness', hue='track_popularity_bin')

plt.show()
No description has been provided for this image

💡Tracks that are more acoustic tend to be less louder

In [ ]:
sns.relplot(data=df, x='duration_ms', y='valence', hue='track_popularity_bin')

plt.show()
No description has been provided for this image

💡 Observations

  1. Popular tracks sound more positive (higher valence value) and these songs have a shorter duration
  2. Unpopular tracks sound more negative (lower valence value) and these songs have a longer duration

Next up, PAIR PLOT is used to visualize the relationships between the other continuous variables grouped by the target variable that weren't discussed above

In [ ]:
sns.pairplot(data=df[[ 'energy', 'key', 'mode',
                      'speechiness', 'instrumentalness',
                      'liveness', 'track_popularity_bin']],
             hue='track_popularity_bin',
             diag_kws={'common_norm': False})

plt.show()
No description has been provided for this image

💡 Observations

  1. Track that have low speechiness tend to have high instrumentalness
  2. Some popular tracks tend to have high speechiness as well as high energy. This could be because of tracks from rap genre that are speechy as well as highly energetic
  3. Tracks that have high liveness tend to have high energy. This could be because songs that have high energy are more probable to be performed infront of live audience
  4. Popular songs that have a mode of 1 tend to have high key value

Visualize the counts of combinations between the response (outcome) and categorical inputs¶

In [ ]:
sns.catplot(data = df, x='playlist_genre', hue='track_popularity_bin', col='release_year_bin', kind='count', )

plt.show()
No description has been provided for this image

💡 Observations

  1. rock used to be the most popular genre in the older years and it has been replaced by pop and rap in the recent years
  2. edm is the most unpopular genre in the recent years
In [ ]:
df.playlist_genre.value_counts(normalize=True)

K-Means Clustering¶

Clustering is an unsupervised machine learning technique designed to group unlabeled examples based on their similarity to each other. For this exercise, we will be using the K-Means method for clustering the given dataset

In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
In [ ]:
columns_to_use = ['danceability', 'energy', 'key', 'loudness', 'mode',
                  'speechiness', 'acousticness', 'instrumentalness',
                  'liveness', 'valence', 'tempo', 'duration_ms',
                  'track_popularity_bin', 'playlist_genre']
df_kmeans = df[columns_to_use]

Preprocessing¶

In [ ]:
df_kmeans.isna().sum()
Out[ ]:
danceability            0
energy                  0
key                     0
loudness                0
mode                    0
speechiness             0
acousticness            0
instrumentalness        0
liveness                0
valence                 0
tempo                   0
duration_ms             0
track_popularity_bin    0
playlist_genre          0
dtype: int64

💡 There are no MISSING VALUES in the dataset

In [ ]:
df_kmeans_features_clean = df_kmeans.select_dtypes('number').copy()
In [ ]:
sns.catplot(data = df_kmeans_features_clean, kind='box', aspect=2)

plt.show()
No description has been provided for this image

💡 Since one variable is dominant the data has to be standardized first remove the MAGNITUDE and SCALE effect. KMeans considers SIMILAR to be based on DISTANCE. Distance depends on MAGNITUDE and SCALE

In [ ]:
# Using sklearn StandardScaler to standardize the dataset

X = StandardScaler().fit_transform(df_kmeans_features_clean)
In [ ]:
sns.catplot(data = pd.DataFrame(X, columns=df_kmeans_features_clean.columns), kind='box', aspect=2)

plt.show()
No description has been provided for this image

📌 Variables have now been standardized

Clustering¶

Starting with two clusters¶

In [ ]:
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(X)
In [ ]:
df_kmeans_clean_copy = df_kmeans.copy()
In [ ]:
df_kmeans_clean_copy['k2'] = pd.Series(clusters_2, index=df_kmeans_clean_copy.index ).astype('category')
In [ ]:
df_kmeans_clean_copy.k2.value_counts()
Out[ ]:
k2
0    20078
1     8278
Name: count, dtype: int64
In [ ]:
fig, ax = plt.subplots(figsize=(20,5))

sns.heatmap(data = pd.crosstab(df_kmeans_clean_copy.track_popularity_bin,
                               df_kmeans_clean_copy.k2,
                               margins=True ),
            annot=True,
            annot_kws={"fontsize": 20},
            fmt='g',
            cbar=False,
            ax=ax)

plt.show()
No description has been provided for this image

The above heatmap tells us that most songs have ended up in cluster 0. This suggests that more clusters are needed to find songs with distinctive characteristics that end up in the popular and unpopular categories

Finding Optimal number of clusters¶

Will be finding optimal number of clusters using the KNEE BEND PLOT!

In [ ]:
tots_within = []

K = range(1, 15)

for k in K:
    km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
    km = km.fit(X)
    tots_within.append( km.inertia_ )
In [ ]:
fig, ax = plt.subplots()

ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')

plt.show()
No description has been provided for this image

📌 Although, there isn't a clean KNEE BEND here we can see that the plot starts to bend around the cluster value of 5 and is prominent before 8. So, I have decided to go with 7 clusters for further analysis

In [ ]:
clusters_7 = KMeans(n_clusters=7, random_state=121, n_init=25, max_iter=500).fit_predict(X)
In [ ]:
df_kmeans_clean_copy = df_kmeans.copy()
In [ ]:
df_kmeans_clean_copy['k7'] = pd.Series(clusters_7, index=df_kmeans_clean_copy.index ).astype('category')
In [ ]:
df_kmeans_clean_copy.k7.value_counts()
Out[ ]:
k7
3    6844
4    5963
2    5038
6    3510
5    3123
0    2211
1    1667
Name: count, dtype: int64
In [ ]:
fig, ax = plt.subplots(figsize=(20,5))

sns.heatmap(data = pd.crosstab(df_kmeans_clean_copy.track_popularity_bin,
                               df_kmeans_clean_copy.k7,
                               margins=True ),
            annot=True,
            annot_kws={"fontsize": 20},
            fmt='g',
            cbar=False,
            ax=ax)

plt.show()
No description has been provided for this image

💡 The above heatmap tells that clusters 0,1,2 tend to have more unpopular songs than clusters 3,4,5,6. This is much better than two clusters

Visualizing relationships and conditional distributions using PAIR PLOT¶

Finally, using PAIR PLOT to visualize the relationships between continuous variables GROUPED BY the cluster category as well as conditional distribution of each continuous variable GROUPED BY the cluster category

In [ ]:
# NOTE: I have used a sample of 5000 because my notebook kept crashing
sns.pairplot(data = df_kmeans_clean_copy.sample(5000), hue='k7', diag_kws={'common_norm': False},
             palette='viridis')

plt.show()
No description has been provided for this image

💡 Observations

  1. Tracks in the clusters 0,1,2 tend to have higher energy compared to the clusters 3,4,5,6
  2. Tracks in cluster 5 have lower energy and duration_ms between 200000 and 300000
  3. Cluster 0 seems to be made of tracks with high instrumentalness and high energy

Will document more observations during the second part of the assignment